Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 24 more languages, including JSON, Kotlin, XML, YAML etc... #33

Merged
merged 23 commits into from
Jul 25, 2021

Conversation

yoeo
Copy link
Owner

@yoeo yoeo commented Jun 29, 2021

Support the following languages:

  • Assembly
  • CSV
  • Dart
  • Fortran
  • Groovy
  • INI
  • JSON
  • Julia
  • Kotlin
  • Pascal (currently broken ⚠️)
  • TOML
  • TypeScript
  • VBA
  • XML
  • YAML

Prediction accuracy is 92.59% but the training and test dataset were not well balanced due to lack of files for some languages.
And there were errors in the Pascal dataset.

@yoeo
Copy link
Owner Author

yoeo commented Jun 29, 2021

Prediction results with 167k test files:
image

@yoeo yoeo mentioned this pull request Jun 29, 2021
@TylerLeonhardt
Copy link

This is great @yoeo! I did notice some decrease in confidence for Java. The following snippet use to have over 60% confidence:

public class PositiveNegative {

    public static void main(String[] args) {

        double number = 12.3;

        // true if number is less than 0
        if (number < 0.0)
            System.out.println(number + " is a negative number.");

        // true if number is greater than 0
        else if ( number > 0.0)
            System.out.println(number + " is a positive number.");

        // if both test expression is evaluated to false
        else
            System.out.println(number + " is 0.");
    }
}

but using this branch, it's down to 20% confident it's Java. My guess is that the introduction of Groovy hurt the confidence?

@yoeo
Copy link
Owner Author

yoeo commented Jul 15, 2021

Nice catch @TylerLeonhardt.
You're probably right about the effects of Groovy support on Java detection.

This model is still "work in progress" and I hope that training it with more examples and for a longer time will help improve its predictions.

@TylerLeonhardt
Copy link

@yoeo the JSON and YAML predictions were great, btw. Such a game changer :)

I hope to have this in a VS Code Insider release either this week or next. Exciting times!

@yoeo
Copy link
Owner Author

yoeo commented Jul 20, 2021

Hi, I updated the model.
It now uses a way more balanced and clean dataset. It also supports even more languages than before (44 → 53 languages).
⚠️ But this model is barely trained ⚠️
I still need to train it for many hours and maybe tweak it a little to improve its accuracy before merging it.

image

@yoeo
Copy link
Owner Author

yoeo commented Jul 20, 2021

@TylerLeonhardt

I investigated on the confidence drop that you noticed.
Indeed, adding more languages hurts the prediction confidence.
Fortunately, the model still assigns the highest probability value to the correct language 91% of time.

For example, here is are box plots of the probabilities that I got by testing 5k Java files:

  • using the model that is on the main branch
    Java

  • and using the model that is on this PR
    Java

We can see that the addition of Groovy and Dart hurts Java detection confidence, but almost all the time the files are still correctly detected as Java files.

The probability plots for all the languages are available here:

@TylerLeonhardt
Copy link

@yoeo this is amazing work! I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence.

For example: 30% Java and <1% everything else means it's probably Java.

I don't know if 30%/1% is the best pair of numbers...but I'll give it a go. I'm open to suggestions from you since you're the expert 😃

@yoeo yoeo changed the title Support 14 more languages, including JSON, Kotlin, XML, YAML etc... Support 24 more languages, including JSON, Kotlin, XML, YAML etc... Jul 24, 2021
@yoeo
Copy link
Owner Author

yoeo commented Jul 24, 2021

Hi @TylerLeonhard

The model is now fully trained. Its overall accuracy is pretty good ~93.5% (the original model accuracy was ~93.8%)
The confidence scores increased a bit compared to the untrained model that I pushed earlier.
For example, your sample code is now detected with ~41% confidence:

echo "public class PositiveNegative {
....
}" | guesslang --probabilities
Language name       Probability
 Java                 41.63%
 Groovy               24.83%
 C#                    6.17%
 ...

I'm pretty happy with these results and I'll merge this PR after updating the documentation.


I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence.
For example: 30% Java and <1% everything else means it's probably Java.

You're perfectly right I think.
In fact I use a variant of this solution to check if there is a clear winner or not:

def _is_reliable(probabilities: List[float]) -> bool:
"""Arbitrary rule to determine if the prediction is reliable:
The predicted language probability must be higher than
2 standard deviations from the mean.
"""
threshold = mean(probabilities) + 2*stdev(probabilities)
predicted_language_probability = max(probabilities)
return predicted_language_probability > threshold

And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule 🙂

Thanks.

@TylerLeonhardt
Copy link

And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule 🙂

😁 interesting! Thanks for sharing. I think I'll try to make sure my solution aligns with that and with what you're already doing.

Excited to see this change go in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants